Mega Weaver: A Simple Iterative Approach for BAC Consensus Assembly

نویسندگان

  • Daolong Wang
  • Mario Lauria
  • Bo Yuan
  • Fred A. Wright
چکیده

Hierarchical genome assembly can be divided into three distinct stages: sequencing and assembling shotgun reads for each of a series of selected BAC clones; assembling the resulting fragments into BAC consensus sequences; and mapping and orienting the BAC consensus according to external positional information. We report a new approach for BAC consensus assembly that relies on iterative layouts of overlapping sequence, with no need for prior masking of repetitive sequence. The approach includes major steps of quality filtering and an iterative screening algorithm within and between clusters of overlapping BAC fragments. Each step includes numerous minor steps designed to detect false overlaps at minimal expense in true overlaps. In contrast to dynamic algorithms, our approach attempts to minimize false overlaps before attempting to form BAC consensus sequences. We show that false overlaps are reduced to a degree that final BAC consensus assembly is straightforward under a coordinate system described in the paper. Using human chromosome 22 and a range of simulation conditions, an average of 98.1% false overlaps could be removed, while 6.7% of true overlaps were not detected. The final assembled BAC consensus sequences were nearly optimal, and support the usefulness of our approach for future hierarchical sequencing projects.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Atlas genome assembly system.

Atlas is a suite of programs developed for assembly of genomes by a "combined approach" that uses DNA sequence reads from both BACs and whole-genome shotgun (WGS) libraries. The BAC clones afford advantages of localized assembly with reduced computational load, and provide a robust method for dealing with repeated sequences. Inclusion of WGS sequences facilitates use of different clone insert s...

متن کامل

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the mega-reads algorithm

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data...

متن کامل

Estimating the repeat structure and length of DNA sequences using L-tuples.

In shotgun sequencing projects, the genome or BAC length is not always known. We approach estimating genome length by first estimating the repeat structure of the genome or BAC, sometimes of interest in its own right, on the basis of a set of random reads from a genome project. Moreover, we can find the consensus for repeat families before assembly. Our methods are based on the l-tuple content ...

متن کامل

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm.

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data...

متن کامل

CAP3: A DNA sequence assembly program.

We describe the third generation of the CAP sequence assembly program. The CAP3 program includes a number of improvements and new features. The program has a capability to clip 5' and 3' low-quality regions of reads. It uses base quality values in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. The program also...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004